Business Context¶

Action for Happiness is a global movement aimed at increasing well-being and contentment in people everywhere. They offer tools, resources, and courses based on the most recent research in science aimed at helping individuals live happier lives.

I, Patrick Chaccour, a hired Data Consultant, was handed a dataset, which includes country rankings based on population happiness. These rankings consider a variety of characteristics, including GDP per capita, healthy life expectancy, generosity, freedom... With this Dataset I will conduct an exploratory data analysis (EDA), to aid Action for Happines reach their objectives.

The corporation's stakeholders, such as policymakers, researchers, and members of the general population interested in marketing happiness on a worldwide scale, are among the intended targeted audience.

The goal of the EDA is to provide information regarding the relationship between these parameters and the overall happiness of populations in various countries. Precisely, by examining the data, we will identify the countries that need particular interventions. For instance, if the study reveals that a given location has low levels of happiness, Action for Happiness can create specialized programs or efforts to meet those needs. This EDA will hand Actions for Happiness evidence and insights and will enable them to make Data-Driven Decisions. They will also use the pipeline to look into trends, test theories, and assess the efficacy of various approaches.

Importing Libraries¶

In [1]:
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

import plotly.graph_objects as go

Loading & prepping the dataset¶

In [2]:
df_originial = pd.read_csv('2021.csv')
df_originial.head()
Out[2]:
Country name Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
0 Finland Western Europe 7.842 0.032 7.904 7.780 10.775 0.954 72.0 0.949 -0.098 0.186 2.43 1.446 1.106 0.741 0.691 0.124 0.481 3.253
1 Denmark Western Europe 7.620 0.035 7.687 7.552 10.933 0.954 72.7 0.946 0.030 0.179 2.43 1.502 1.108 0.763 0.686 0.208 0.485 2.868
2 Switzerland Western Europe 7.571 0.036 7.643 7.500 11.117 0.942 74.4 0.919 0.025 0.292 2.43 1.566 1.079 0.816 0.653 0.204 0.413 2.839
3 Iceland Western Europe 7.554 0.059 7.670 7.438 10.878 0.983 73.0 0.955 0.160 0.673 2.43 1.482 1.172 0.772 0.698 0.293 0.170 2.967
4 Netherlands Western Europe 7.464 0.027 7.518 7.410 10.932 0.942 72.4 0.913 0.175 0.338 2.43 1.501 1.079 0.753 0.647 0.302 0.384 2.798

Columns Expained:

  • Ladder Score: A metric measured by asking the sampled people the question: "How would you rate your happiness on a scale of 1 to 10
  • Standard error of ladder score: margin of error
  • upperwhisker and lower whisker: highest and lowest ladder score
  • Logged GDP per capita: To compress the scale and limit the influence of extreme values, the log transformation is used to GDP per capita values. The resulting score is a measure of relative economic well-being. Higher values indicate higher GDP per capita levels,
  • Social Support: this column outputs the percentage of social support provided to the public
  • Healthy Life Expectancy: this column outputs the average age in which citizen from such country will live to.

Cleaning the Dataset:

  • The Datasets are ranked from happier countries to less happier ones. Our objective is to aid the weak and depressed. So we must flip the dataset.

  • Drop all unnecessary columns

  • Unlog the logged GDP per capita so we can use them with more ease.

  • Create Sub-datasets

In [3]:
#Flipping the Dataset
df = df_originial[::-1].reset_index(drop=True)

#Dropping all unnecessary columns
columns_to_drop = ['Generosity', 'Ladder score in Dystopia', 'Explained by: Log GDP per capita',
                   'Explained by: Social support', 'Explained by: Healthy life expectancy', 
                   'Explained by: Freedom to make life choices', 'Explained by: Perceptions of corruption', 
                   'Dystopia + residual']
df = df.drop(columns_to_drop, axis=1)

# Used ChatGPT to anti log the GDP per capita
df['Logged GDP per capita'] = np.exp(df['Logged GDP per capita'])

#Renaming Columns
df = df.rename(columns= {'Standard error of ladder score': 'Standard margin of Error',
                         'Ladder score': 'Happiness Score', 
                         'upperwhisker': 'Highest Score',
                         'lowerwhisker': 'Lowest Score',
                         'Logged GDP per capita': 'GPD per capita',
                         'Freedom to make life choices': 'Freedom',
                         'Explained by: Generosity': 'Perception of Generosity'})

df.head()
Out[3]:
Country name Regional indicator Happiness Score Standard margin of Error Highest Score Lowest Score GPD per capita Social support Healthy life expectancy Freedom Perceptions of corruption Perception of Generosity
0 Afghanistan South Asia 2.523 0.038 2.596 2.449 2197.333810 0.463 52.493 0.382 0.924 0.122
1 Zimbabwe Sub-Saharan Africa 3.145 0.058 3.259 3.030 2815.795236 0.750 56.201 0.677 0.821 0.157
2 Rwanda Sub-Saharan Africa 3.415 0.068 3.548 3.282 2155.978587 0.552 61.400 0.897 0.167 0.227
3 Botswana Sub-Saharan Africa 3.467 0.074 3.611 3.322 17712.041536 0.784 59.269 0.824 0.801 0.027
4 Lesotho Sub-Saharan Africa 3.512 0.120 3.748 3.276 2768.331303 0.787 48.700 0.715 0.915 0.103
In [4]:
#Asked chatgpt to create a subset of the first 10 and last 10 rows of the original datset
first_10_rows = df.head(10)
last_10_rows = df.tail(10)

subset = pd.concat([first_10_rows, last_10_rows])

Data Quality¶

We must examine the data for missing values, outliers, and discrepancies. It is also critical to grasp the data types and variable distributions.

Data Types:¶
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Country name               149 non-null    object 
 1   Regional indicator         149 non-null    object 
 2   Happiness Score            149 non-null    float64
 3   Standard margin of Error   149 non-null    float64
 4   Highest Score              149 non-null    float64
 5   Lowest Score               149 non-null    float64
 6   GPD per capita             149 non-null    float64
 7   Social support             149 non-null    float64
 8   Healthy life expectancy    149 non-null    float64
 9   Freedom                    149 non-null    float64
 10  Perceptions of corruption  149 non-null    float64
 11  Perception of Generosity   149 non-null    float64
dtypes: float64(10), object(2)
memory usage: 14.1+ KB
Checking for Missing Values¶
In [6]:
df.isnull().sum()
Out[6]:
Country name                 0
Regional indicator           0
Happiness Score              0
Standard margin of Error     0
Highest Score                0
Lowest Score                 0
GPD per capita               0
Social support               0
Healthy life expectancy      0
Freedom                      0
Perceptions of corruption    0
Perception of Generosity     0
dtype: int64
Checking for Outliers & Inconsistencies¶
In [7]:
# To obtain summary statistics of the data set, i used this code to pick numeric columns from the Dataframe
# and generate a new one that solely contains numeric columns, and call the the function decribe() for the summary.

numeric_columns = df.select_dtypes(include=[np.number])
numeric_columns.describe()
Out[7]:
Happiness Score Standard margin of Error Highest Score Lowest Score GPD per capita Social support Healthy life expectancy Freedom Perceptions of corruption Perception of Generosity
count 149.000000 149.000000 149.000000 149.000000 149.000000 149.000000 149.000000 149.000000 149.000000 149.000000
mean 5.532839 0.058752 5.648007 5.417631 21560.608440 0.814745 64.992799 0.791597 0.727450 0.178047
std 1.073924 0.022001 1.054330 1.094879 20908.784656 0.114889 6.762043 0.113332 0.179226 0.098270
min 2.523000 0.026000 2.596000 2.449000 761.279066 0.463000 48.478000 0.382000 0.082000 0.000000
25% 4.852000 0.043000 4.991000 4.706000 5120.462265 0.750000 59.802000 0.718000 0.667000 0.105000
50% 5.534000 0.054000 5.625000 5.413000 14314.095070 0.832000 66.603000 0.804000 0.781000 0.164000
75% 6.255000 0.070000 6.344000 6.128000 33556.974347 0.905000 69.600000 0.877000 0.845000 0.239000
max 7.842000 0.173000 7.904000 7.780000 114347.804564 0.983000 76.953000 0.970000 0.939000 0.541000
Analyzing Variable Distributions¶
In [8]:
numeric_columns.hist(bins=20, figsize=(10, 6))

#Asked ChatGpt to Make the histograms fit properly, without overlapping
plt.tight_layout()

plt.show()

Explanatory Data Visualization¶

To extract insights from data, we can utilize visualization techniques such as scatter plots, bar charts, box plots, heatmaps, and geographical maps. Each insight should be accompanied by appropriate visualizations and a compelling story that illustrates why the insight is important to the business.

Distribution of countries across different regions.¶

In [9]:
# Isolated the top 50 rows
df_top50 = df[:50]

# Plotted a pie chart to show in what region of the world do the majority of the sad countries lie. 
fig, ax = plt.subplots(figsize=(6, 6))
df_top50['Regional indicator'].value_counts().plot(kind='pie')
plt.title('Distribution of top 50 Saddest Countries by Region')
plt.show()

This Pie Chart indicates that the majority of the sadder countries lie in the Sub-Saharan, North African, Middle Eastern and South Asian Countries. Which means Action for Happiness must focus on projects to aid such Regions.

However, to help such devistated regions, we must find out what is causing this. We must compare both sad and happy countries and understand what is going on, so we could fix it.

The impact of Freedom on Happiness¶

In [10]:
#Used Chatgpt to add a trendline to the scatter plot
coefficients = np.polyfit(df['Freedom'], df['Happiness Score'], 1)
m = coefficients[0] 
c = coefficients[1]  

#Customized the Scatter Plot
plt.scatter(df['Freedom'],df['Happiness Score'], color = 'blue')
plt.plot(df['Freedom'], m*df['Freedom'] + c, color='red', label='Trendline')
plt.xlabel('Percentage of Freedom')
plt.ylabel('Happiness Score')
plt.title('Does Freedom affect happiness?')

plt.show
Out[10]:
<function matplotlib.pyplot.show(close=None, block=None)>

The data reveals a favorable relationship between countries with greater freedom, including freedom of expression, choice, and dress, and overall pleasure. This statement is supported by the data's upward trend. It is worth mentioning, however, that despite some locations demonstrating great levels of freedom, happiness levels remain relatively low in some cases. This implies the presence of extra elements influencing a country's overall well-being in addition to freedom.

Life Expectancy in Happier Countries vs Sadder Countries¶

In [11]:
#Isolated rows from datasets to get the 10 saddest and 10 happiest countries
table1 = subset['Country name'].head(10)
table2 = df_originial['Country name'].head(10)

#Customization
table1_title = 'Top 10 Sadder countries'
table2_title = 'Top 10 Happier countries'

#Displayed
print(table1_title)
print(table1)
print('\n' + table2_title)
print(table2)
Top 10 Sadder countries
0    Afghanistan
1       Zimbabwe
2         Rwanda
3       Botswana
4        Lesotho
5         Malawi
6          Haiti
7       Tanzania
8          Yemen
9        Burundi
Name: Country name, dtype: object

Top 10 Happier countries
0        Finland
1        Denmark
2    Switzerland
3        Iceland
4    Netherlands
5         Norway
6         Sweden
7     Luxembourg
8    New Zealand
9        Austria
Name: Country name, dtype: object

Before us lies a list of the top 10 saddest and happiest countries in 2021 according to the Dataset

I decided to add both these poles of the rankings into one histogram for better comparison. We are comparing life expectancies in countries such as Afghanistan and Lesotho as well as Switzerland or Iceland.

In [12]:
# Asked Chatgpt to help customize the colors on the barplot
custom_palette = ["red" if y < 65 else "green" for y in subset['Healthy life expectancy']]

ax = sns.barplot(x='Country name', y='Healthy life expectancy', data=subset, palette=custom_palette)

# Asked Chatgpt to rotate the countries on the x label so they would fit
plt.xticks(rotation=45)
plt.tight_layout() 
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)


plt.xlabel('Country') 
plt.ylabel('Life Expectancy')
plt.title('Life expectancy for top 10 Saddest & Happiest counntries', fontsize=14)  

# Display
plt.show()

The comparison of life expectancy in happy and sadder countries indicates a significant gap. According to the graph, happier countries have greater life expectancies, reaching 70 years old, whilst sadder countries have much lower life expectancies, averaging at 55 years old. This disparity shows that the poorer countries may confront difficult situations such as limited access to key resources such as food and water, as well as proper medical treatment. This observation highlights the possibility of poverty-related problems existing inside these countries.

Hence, we must look into the financing of such regions...

GPD per capita per Region¶

In [13]:
# Created a list of every Region on the Data Set
Regions = df['Regional indicator'].unique().tolist()
Regions
Out[13]:
['South Asia',
 'Sub-Saharan Africa',
 'Latin America and Caribbean',
 'Middle East and North Africa',
 'Southeast Asia',
 'Commonwealth of Independent States',
 'Central and Eastern Europe',
 'East Asia',
 'Western Europe',
 'North America and ANZ']
In [14]:
#Used ChatGPT to help set all the boxplots in a grid
num_regions = len(Regions)
num_cols = 2  
num_rows = (num_regions - 1) // num_cols + 1 

# Set the characteristics of the subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, num_rows * 4))

# Used ChatGPT to create this loop to plot boxplots for each region
for i, region in enumerate(Regions):
    df_iso = df.query('`Regional indicator` == @region')
    row = i // num_cols  
    col = i % num_cols  
    ax = axes[row][col]   
    sns.boxplot(x='GPD per capita', data=df_iso, ax=ax)
    ax.set_title(f'Boxplot for GPD per capita - {region}')
    
# Display
plt.tight_layout()
plt.show()

Further examination of the box plot focused on the Sub-Saharan region, which was determined as the saddest region based on the pie chart, revealed exceptionally low GPD per capita values, with an average of 3000 dollars and the majority lying between 2000 dollars and 5000 dollars. On the other hand, in better regions, such as Western Europe, GPD per capita averaged roughly 52000 dollars, with the bulk falling between 40000 dollars and 58000 dollars.

In conclusion, the findings strongly show that poverty is a key contributor to a country's overall misery. The observed variations in economic measures, such as GDP per capita, between unhappy and cheerful places underline the essential role that socioeconomic factors play in shaping well-being outcomes.

In [15]:
#Asked Chatgpt to help colorcode and display the figure
fig = go.Figure(data=go.Choropleth(locations=df['Country name'],
    z=df['GPD per capita'],
    locationmode='country names',
    colorscale='YlOrRd_r',
    colorbar_title='GPD per capita',))

fig.update_layout(title='GPD per capita by Country',
                  geo=dict(showframe=False, showcoastlines=False,),)

fig.show()

The displayed choropleth map depicts the global distribution of countries based on their GDP per capita in great detail. The color pattern, which ranges from dark red to bright yellow, depicts the economic spectrum, with darker shades signifying severely low GPD values (below 5000 dollars) and brighter colors denoting rich economies (over 100,000 dollars).

A close analysis of the chart reveals that nations in Africa and South Asia have darker hues, signifying serious economic issues and notably low GPD per capita. Countries in Europe, on the other hand, show tones of orange and yellow, signifying substantially higher economic well-being, with GPD per capita above 50,000 dollars.

The map improves our comprehension of the global distribution of economic prosperity and the striking contrast between places with varying GPD per capita levels by showing these discrepancies.

However we do need to know if such low levels or high levels of GDP is caused by governmental corruption

Level of Corruption¶

In [16]:
# average perceptions of corruption per country.
average_salaries = df.groupby('Country name')['Perceptions of corruption'].mean().reset_index()

# Creating the bar plot
fig = go.Figure(data=go.Bar(x=average_salaries['Country name'], y=average_salaries['Perceptions of corruption']))

# Customization
fig.update_layout(title='Level of Coruption Worlwide', xaxis_title='Country',yaxis_title='Perceptions of corruption')

# Display
fig.show()

The plot provides information about the prevalence of corruption in various countries. Each bar represents a country, and the height of the bar symbolizes the country's average level of perceived corruption. Higher bars imply a higher perception of corruption, whereas lower bars indicate a lesser perception.

We can identify countries with substantially higher or lower degrees of corruption by inspecting the plot. This image helps us comprehend the global distribution and differences in corruption perceptions, offering useful information for future analysis and research on this global topic.

As we can see coruption is everywhere. And it is difficult to identify whether it played the main role in a countries misery.

Social Support in Sad Countries vs Happier Countries¶

In [17]:
sns.barplot(x= subset['Country name'], y=subset['Social support'])

# Rotate x labels so they can fit
plt.xticks(rotation=45)

# Customization
plt.xlabel('Country')
plt.ylabel('Social support')
plt.title('Social support by Country')
Out[17]:
Text(0.5, 1.0, 'Social support by Country')

The analysis of the data reveals a significant discrepancy in social support between the top ten happiest and saddest countries. When compared to the happy countries, the sad countries have much lower levels of social support, frequently less than half. This disparity in social assistance shows that the sadder countries' lack of appropriate support structures may contribute to their overall experience of misery.

Final Discussion and Conclusion¶

As a data consultant for Action for Happiness, I did an exploratory data study to discover insights about population wellbeing and the elements that contribute to it. Several major findings resulted from this inquiry:

  • Strengths and Limitations: The dataset employed in this investigation, which includes many aspects of happiness and associated factors, is one of its strengths. Visualizations such as bar graphs, scatter plots, and choropleth maps aided in the comprehension of the data. Furthermore, identifying geographical patterns and comparing happy and sad countries provided useful information.

However,the dataset is based on self-reported data. Furthermore, the approach is based on correlational links rather than causation. More study and data collection activities would be good for gaining a more comprehensive understanding of the elements that influence happiness.

  • Insights and Implications: The study found a high positive association between GDP per capita, social support, and life expectancy and happiness scores. This implies that improving economic conditions, creating social support networks, and increasing healthcare infrastructure could all have a favorable impact on overall satisfaction levels.

Furthermore, the identification of specific regions with lower happiness scores, such as Sub-Saharan Africa and South Asia, emphasizes the importance of focused interventions in these countries. Investing in poverty reduction, education, and healthcare programs could boost people's well-being and happiness in these areas.

  • Data-Driven Recommendations: Socioeconomic Development: Work with governments, non-governmental organizations, and local communities to promote socioeconomic development, with the goal of increasing GDP per capita.

Healthcare and Well-Being: Advocate for better medical systems and facilities. This could include collaborations with healthcare groups, policy advocacy, and public awareness campaigns.

Intervention: Concentrate on designing and implementing tailored intervention programs in regions with lower happiness levels, taking into account each region's unique socioeconomic and cultural environment.

Constant Data Monitoring: By implementing an EDA pipeline every year, we aid Action for Happiness to track progress over time. They can gather and analyze data on a regular basis to evaluate the success of their initiatives and track changes in happiness levels. This monitoring enables them to change their plans based on real-time input and ensure they are having a beneficial influence on the well-being of individuals.

Action for Happiness can help to create a happier world by addressing major variables that influence happiness and increasing the well-being of individuals and communities by implementing these recommendations.

Submission Form¶

In [18]:
from IPython.display import Image

Image(filename='DATA VISUALIZATION.jpg')
Out[18]: